AI GPU Ollama Continue Dev

Pull Ollama Model

# Pull the 8-bit (30GB) instruction-tuned model
ollama pull gemma3:27b-it-q8_0

Ollama Modelfile

Modelfile.gemma-coder-v1

# Start from our new, higher-fidelity 8-bit model
FROM gemma3:27b-it-q8_0

# === Enhancement 1: Max Out The Context Window ===
# Gemma 3 27B supports 128K (131072) tokens.
# Maxing this out gives RAG tools like Continue.dev the
# largest possible "workspace" to feed your project's context.
PARAMETER num_ctx 131072

# === Enhancement 2: Advanced Coherence Sampling (Mirostat) ===
# We are replacing nucleus sampling (temp/top_k/top_p)
# with Mirostat 2.0. Mirostat targets a specific level
# of "coherence" (perplexity), which is often superior for
# long, logical, multi-step outputs like code files.
# It "learns" as it generates to stay on topic.
PARAMETER mirostat 2           # Use Mirostat 2.0
PARAMETER mirostat_tau 5.0     # Controls the target coherence (5.0 is default)
PARAMETER mirostat_eta 0.1     # Controls the "learning rate" (0.1 is default)

# (We no longer need temperature, top_k, or top_p)

# === Enhancement 3: Explicit Stop Tokens ===
# Tell the model exactly when its "turn" is over.
# This prevents it from rambling or role-playing as the user.
# "<end_of_turn>" is the specific token used by Gemma 3.
PARAMETER stop "<end_of_turn>"
PARAMETER stop "<|user|>"
PARAMETER stop "User:"

# === Enhancement 4: "v2" System Prompt ===
# We are making the prompt more aggressive by forcing
# a Chain-of-Thought and Self-Correction loop.
# We also add specific formatting instructions for multi-file code.
SYSTEM """
You are an expert-tier AI-pair programmer and software architect.
You will be given a task and context from a codebase.
Your response must be 100% code, configuration, or the requested artifact.
Do not add conversational fluff, greetings, or explanations.

Follow this multi-step reasoning process:
PHASE 1: Think step-by-step. Break down the user's request into a concrete plan.
PHASE 2: Execute the plan. Write the code or configuration to satisfy the request.
PHASE 3: Review and self-correct. Check your output for errors, optimizations, and adherence to the "100% code" rule.

When you must provide code for multiple files, use this *exact* markdown format:

--- path/to/your/file.py ---
python
# ... your code for file.py ...


--- path/to/another/file.json ---
json
{
  "key": "value"
}
Begin execution. """

Ollama Customized Model

    ollama create gemma-coder-v1 -f ./Modelfile.gemma-coder-v1

Embedding Model

ollama pull mxbai-embed-large-v1

Continue Dev

Edit ~/.continue/config.yaml

# ~/.continue/config.yaml

models:
  - name: "Local Coder"
    title: "Gemma 3 27B Coder v2 (Q8)"
    provider: ollama
    model: "gemma-coder-v2:latest"
    apiBase: "http://localhost:11434"

  - name: "Local Embedder (SOTA)"
    title: "MixedBread Embed Large"
    provider: ollama
    # === THE UPGRADE ===
    model: "mxbai-embed-large-v1:latest" 
    apiBase: "http://localhost:11434"
    embed: true

# ... (rest of your config) ...

modelRoles:
  chat: "Local Coder"
  edit: "Local Coder"
      
contextProviders:
  - name: "code"
    params:
      "embeddingsProvider": "Local Embedder (SOTA)"
          
# ... (rest of your config) ...

# ~/.continue/config.yaml

models:
  - name: "Local Coder"
    title: "Gemma 3 27B Coder v1 (Q8)"
    provider: ollama
    model: "gemma-coder-v1:latest"
    apiBase: "http://localhost:11434"

  - name: "Local Embedder (SOTA)"
    title: "MixedBread Embed Large"
    provider: ollama
    # === THE UPGRADE ===
    model: "mxbai-embed-large-v1:latest" 
    apiBase: "http://localhost:11434"
    embed: true

# ... (rest of your config) ...

modelRoles:
  chat: "Local Coder"
  edit: "Local Coder"
      
contextProviders:
  - name: "code"
    params:
      "embeddingsProvider": "Local Embedder (SOTA)"
  - name: "docs"
  - name: "diff"
  - name: "terminal"        
# ... (rest of your config) ...

embeddingsProvider: "Local Embedder (SOTA)"

Domain-Tuning with (Q)LoRA

First, let's clarify the terms.

  • LoRA (Low-Rank Adaptation): This is the core technique. Instead of re-training the entire 30GB model, you "freeze" it and train a tiny new set of weights (the "adapter," often < 100MB). This adapter "steers" the models output to be better at a specific task (e.g., writing only Rust code, or adopting a specific persona).

  • QLoRA (Quantized LoRA): This is an efficiency technique for training LoRAs. It quantizes the base model (e.g., to 4-bit) during the training process, which dramatically lowers the VRAM needed to perform the fine-tuning.

For your goal (applying a pre-trained adapter), you only need to care about the LoRA file itself.

How to Apply LoRA Adapters in Ollama

Ollama makes this incredibly simple using the ADAPTER instruction in a Modelfile. This "melds" the adapter with the base model at runtime.

Let's build a new model, gemma-coder-v2-rust, that's specifically tuned for Rust development.

  1. Step 1: Get a LoRA Adapter You'll find these adapters on Hugging Face. The key is to find adapters in GGUF format, as these are compatible with Ollama.

Go to: Hugging Face Collections (ggml-org)

Search for: Adapters trained for your specific domain (e.g., "code," "python," "rust"). Let's pretend we found a hypothetical adapter file named rust-code-lora.gguf.

  1. Step 2: Create a New "Specialist" Modelfile Create a new file named Modelfile.gemma-rust:

Modelfile.gemma-rust

# Start from our high-fidelity v2 coder model
FROM gemma-coder-v2

# === Apply the LoRA Adapter ===
# This is the magic. Ollama will load the gemma-coder-v2
# and then "attach" this adapter on top of it.
# This must be a path to the .gguf file.
ADAPTER ./rust-code-lora.gguf

# We can even override the system prompt to be more specific
SYSTEM """
You are an expert-tier Rust and systems programming AI.
You will be given a task related to Rust, Cargo, or backend development.
Your response must be 100% code, configuration, or the requested artifact.
You must adhere to Rust idioms and best practices, paying close
attention to the borrow checker, memory safety, and performance.

Follow this multi-step reasoning process:
PHASE 1: Think step-by-step.
PHASE 2: Write the Rust code.
PHASE 3: Review and self-correct, checking for common Rust errors.

Begin execution.
"""
  1. Step 3: Build the Specialist Model
ollama create gemma-coder-v1-rust -f ./Modelfile.gemma-rust

You now have another model in your ollama list. When you run gemma-coder-v2-rust, it will load the 8-bit base model and the Rust adapter, giving you hyper-specialized, domain-specific output. You can repeat this for Python, Go, TypeScript, etc., creating a small, efficient "specialist" model for every language in your stack.

Edit you continue dev config to add gemma-coder-v1-rust

For NVIDIA GPUs (Linux/WSL):

Ollama auto-detects your NVIDIA driver (nvidia-smi) and CUDA. The key is to tell it how many layers to offload. We do this by editing the systemd service.

Open the service file for editing:

sudo systemctl edit ollama.service
This will open a blank override file. Paste the following:
[Service]
# OLLAMA_NUM_GPU controls how many layers are offloaded.
# Setting it to a high number (e.g., 99) tells Ollama to offload
# as many layers as your VRAM can possibly fit.
Environment="OLLAMA_NUM_GPU=99"

# Optional: If you have multiple GPUs, uncomment this
# to pin Ollama to the first one (index 0).
# Environment="OLLAMA_MAIN_GPU=0"

Save the file, then reload and restart Ollama:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify: Run your model and watch your VRAM usage:

ollama run gemma3:27b "Why is the sky blue?" &
watch -n 1 nvidia-smi

You should see your GPU's VRAM fill up and the "GPU-Util" climb.